set.seed(0911)
library(ggplot2)
library(gridExtra)
library(cowplot)
library(plotly) # interactif plot
library(ggfortify) # diagnostic plot
library(arm) # binnedplot diagnostic plot in GLM
library(knitr)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(lmtest) # LRtest
library(survey) # Wald test
library(vcdExtra) # deviance test
library(rsample) # for data splitting
library(glmnet)
library(nnet) # multinom, glm
library(caret)
library(ROCR)
ggplot2::theme_set(ggplot2::theme_light())# Set the graphical themeThe optimal prediction of \(Y\) conditionnaly on \(x\) is the regression function \(h(x)=\text{E}[Y\vert x]\). In previous chapter, we assumed \(h(x)\) is linear with respect to \(x\) : \[h(x)=\text{E}[Y\vert x]=x^T\beta, \quad\text{where}\quad Y=x^T\beta+\epsilon\text{, with}\epsilon\sim\mathcal{N}(0,\sigma^2).\]
😕 This linear assumption, however, limits us to continuous responses and does not handle categorical outcomes, thus restricting our ability to address classification tasks.
To broaden our approach, we introduce models that maintain a linear predictor \(\,\,\eta(x)=x^T\beta\) but incorporate a link function \(g\) to map the conditional expectation of \(Y\) to the linear predictor. Specifically, we assume \[ g(\text{E}_\beta[Y\vert x])=x^T\beta, \] where \(g(\cdot)=h^{-1}\) is known as the link function. Consequently, we can express the conditional expectation as \[ \text{E}[Y\vert x]=g^{-1}(\eta(x))=g^{-1}(x^T\beta). \] This formulation, which includes the linear predictor with a link function, allows for greater flexibility in modeling responses of various types, including binary and categorical outcomes.
METHOD
1.\(\,\) Select the conditional distribution of \(Y\vert x\). Choose a probability distribution for \(Y\vert x\) from the natural exponential family.
2.\(\,\) Define the linear predictor. Set \(\eta(x):=x^T\beta\) and select an appropriate link function. Often, the canonical link function is used for simplicity and interpretability.
3.\(\,\) Estimate the parameter \(\beta\). Obtain an estimate \(\widehat{\beta}_n\) of the unknown parameter \(\beta\) using a sample \(\mathcal{D}_n\{(Y_i,x_i)\}_{i=1,\cdots,n}\). Then, the estimated mean response is given by: \[g^{-1}(X\widehat{\beta}_n)\quad\text{where}\quad X=(x_1,\cdots,x_n)^T.\]
A random
variable \(Y\) is said to have a
probability density with respect to a dominant measure \(\nu\), denoted by \(f_{\theta,\phi}\), and to belong to the
natural exponential family \(\mathcal{F}_\theta^{Nat}\) if \(f_{\theta,\phi}\) can be expressed as \[f_{\theta,\phi}(y)=\exp\left(\frac{y\theta-b(\theta)}{\phi}+c(y,\phi)\right),
\] where \(b(\cdot)\) and \(c(\cdot)\) are known and differentiable
functions, satisfying the following properties:
If \(\,Y\) admits a density belonging to the
natural exponential family \(\mathcal{F}_\theta^{Nat}\) then
Example. Let \(Y\sim\mathcal E(\lambda)\) then the probability density function is given by: \[f(y)=\lambda e^{\lambda y}{\Large{\mathbb{1}}}_{y\geq 0}=\exp\left(\ln(\lambda)-\lambda y\right){\Large{\mathbb{1}}}_{y\geq 0} \] Identification: \[ \theta=-\lambda,\quad \phi=1,\quad b(\theta)=-\ln(\lambda)=-\ln(-\theta), \quad c(y,\phi)=0 \] Thus, the following relationships hold:
\(b'(\theta)=-1/\theta=1/\lambda=\text{E}[Y]\) and
\(b''(\theta)=1/\theta^2=1/\lambda^2=\text{Var}[Y]\phi\)
In practice, models often employ the canonical link function for simplicity and ease of interpretability.
Suppose \(Y\) is a random variable with a density
belonging to the natural exponential family \(\mathcal{F}_\theta^{Nat}\), \(s.t.\) \[\text{E}_\theta[Y]=b'(\theta)=\mu,
\] then the funtion \(g(\mu)=(b')^{-1}(\mu)\) is defined as
the the canonical link.
Example. Let \(Y\sim\mathcal E(\lambda)\). We have determined that \(b'(\theta)=-1/\theta\) so the canonical link function is given by: \[ g(\mu)=(b')^{-1}(\mu)=-1/\mu \] \(\Rightarrow\) This corresponds to the (canonical) inverse link function.
The table below summarizes some common distributions along with their respective canonical link functions and names. Here, we denote by \(\mu(x)=\text{E}[Y\vert x]=g^{-1}(\eta(x))=g^{-1}(x^T\beta).\)
| Law | \(\mathcal{B}(p)\)–\(\mathcal{B}\text{in}(N,p)\) | \(\mathcal{P}(\lambda)\) | \(\text{Neg}\mathcal{B}\text{in}(k,p)\) | \(\mathcal{N}(\mu,\sigma)\) | \(\Gamma(\alpha,\beta)\) |
|---|---|---|---|---|---|
| Canonical Link | \(\text{logit}(\mu)\) | \(\log(\mu)\) | \(\log_k(\mu)\) | \(\quad\mu\quad\) | \(\quad-\frac{1}{\mu}\quad\) |
| Name link | Logit | Log | \(\text{Log}_k\) | Identity | Reciprocal |
where \[
\text{logit}(\mu)=\log\left(\frac{\mu}{N-\mu}\right)\quad\text{and}\quad\log_k(\mu)=\log\left(\frac{\mu}{k+\mu}\right)
\]
📝 Remarks.
- When using a “logit link,” the model is referred to as logistic regression. Similarly, with a “logarithmic link,” we have Poisson regression.
- Other non-canonical link functions are also used in practice. For example, the probit link: \(g(\mu)=\Phi^{-1}(\mu)\) where \(\Phi(\cdot)\) is the standard Gaussian distribution function, and the log-log link: \(g(\mu)= \log(-\log(1-\mu))\) with \(\mu\in(0,1)\).
Ⓡ To illustrate the
differences between logistic and linear regression, consider a binary
response variable \(Y\), where \(Y\) takes values in \(\{0,1\}\). In this context, \(Y\) represents whether a payment
default occurs.
Linear Regression Model We begin with a simple linear regression model to predict the probability of default.
Ⓡ The following
code visualizes a linear regression fit for predicting \(Y\) as a function of
balance:
p1 <- ISLR::Default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "lm") +
ggtitle("Linear regression model fit") +
xlab("Balance") +
ylab("Probability of Default")
ggplotly(p1)📈 The plot suggests that the linear model may not be suitable for this binary outcome, as it can predict probabilities outside the \([0,1]\) range. Therefore, selecting an appropriate distribution for\(Y\vert x\) is crucial; a Bernoulli distribution for \(Y\) conditioned on \(x\) is more appropriate in this case, with: \[ p(x)=P(Y=1\vert x)\text{ and } \mu(x)=\text{E}[Y\vert x]=p(x). \]
Logistic Regression Model To correctly model the relationship, we choose the canonical logit link function: \[g(\mu(x))=g(p(x))=\text{logit}(p(x))=\log\left(\frac{p(x)}{1-p(x)}\right).\] With \(\eta(x)=x^T\beta\) and \(\widehat{\beta}_n\) as an estimator of \(\beta\) from \(n\) observations, we estimate \(\text{E}[Y\vert x]=p(x)\) using: \[\widehat{p}(x)=g^{-1}(\widehat{\eta}(x))=g^{-1}(x^T\widehat{\beta}_n)=\frac{e^{x^T\widehat{\beta}_n}}{1+e^{x^T\widehat{\beta}_n}}. \]
Ⓡ The code below compares the logistic model fit:
p2 <- ISLR::Default %>%
mutate(prob = ifelse(default == "Yes", 1, 0)) %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
geom_smooth(method = "glm", method.args = list(family = "binomial")) +
ggtitle("Logistic regression model fit") +
xlab("Balance") +
ylab("Probability of Default")
subplot(ggplotly(p1), ggplotly(p2), nrows = 1)📈 A typical classification rule is to assign a predicted value of \(\widehat Y_i\) if \(\widehat{p}_i=\widehat{p}(x_i)>s\) , with \(s=0.5\) commonly used as a threshold.
Let \(Y=(Y_1,\cdots,Y_n)^\top\) represent the response vector, and define the design matrix as: \[X = \begin{pmatrix} x_{11}&\cdots &x_{1p} \\ \vdots& &\vdots\\ x_{n1}&\cdots &x_{np} \end{pmatrix}= (X_1,\cdots,X_p)=\begin{pmatrix} x_{1}^T\\ \vdots\\ x_{n}^T \end{pmatrix},\] where each \(X_j\), \(j=1\cdots,p\) represents one of the explanatory variables.
Let \(\mathcal{L}(\beta)\) denote the log of the likelihood function. Given that the \(Y_i\) are independent, we have: \[\mathcal{L}(\beta)=\sum_{i=1}^n\log f_{\theta_i,\phi}(Y_i)=\sum_{i=1}^n\mathcal{L}_i(\beta), \] where \(\mathcal{L}_i(\beta)\) is the contribution of the \(i^\text{th}\) observation \((Y_i,x_i)\) to the log of the likelihood: \[\mathcal{L}_i(\beta)=\ell(Y_i,\theta_i,\phi,\beta)=\log f_{\theta_i,\phi}(Y_i)=\frac{Y_i\theta_i-b(\theta_i)}{\phi}+c(Y_i,\phi).\]
Proof. We have
\[\begin{align*} \mathcal{L}(\beta)&=\sum_{i=1}^n \log\left(f_{Y_i}(Y_i)\right)=\sum_{i=1}^n \frac{Y_i\theta_i-b(\theta_i)}{\phi}+c(Y_i,\phi)=\sum_{i=1}^n\mathcal{L}_i(\beta) \end{align*}\] Additionally: \[\begin{align} \label{un} \tag{eq: Prop2-1} &\text{E}[Y_i|x_i]=\mu_i=b'(\theta_i)\Rightarrow \theta_i=[b']^{-1}(\mu_i)\\ \label{trois} \tag{eq: Prop2-2} &\eta_i=x_i^\top\beta\\ \label{deux} \tag{eq: Prop2-3} &g(\text{E}[Y_i|x_i])=g(\mu_i)=\eta_i\Rightarrow \mu_i=g^{-1}(\eta_i):=h(\eta_i)\\ \label{quatre} \tag{eq: Prop2-4} &\text{Var}[Y_i|x_i]=b''(\theta_i)\phi\\ \label{cinq} \tag{eq: Prop2-4} &x_i^\top\beta=\sum_{j=1}^p x_{ij}\beta_j \end{align}\]
We express the per-observation contribution to the likelihood as: \[\mathcal{L}_i(\beta)=\frac{Y_i\theta_i-b(\theta_i)}{\phi}+c(Y_i,\phi)\] We express the per-observation contribution to the likelihood as: \[ \frac{\partial \mathcal{L}_i(\beta)}{\partial\beta_j}=\frac{\partial \mathcal{L}_i(\beta)}{\partial\theta_i}\frac{\partial\theta_i}{\partial\mu_i} \frac{\partial\mu_i}{\partial\eta_i}\frac{\partial\eta_i}{\partial\beta_j} \] Calculating each term individually:
\[\begin{align*} \frac{\partial \mathcal{L}_i(\beta)}{\partial\theta_i}&=\frac{Y_i-b'(\theta_i)}{\phi}\\ \frac{\partial\theta_i}{\partial\mu_i}&=\frac{1}{b''(\theta_i)}=\frac{\phi}{\text{Var}[Y_i|x_i]}\quad\text{car par} \eqref{un}\quad \mu_i=b'(\theta_i)\\ \frac{\partial\mu_i}{\partial\eta_i}&=h'(\eta_i)\quad\text{ par} \quad \eqref{deux}\\ \frac{\partial\eta_i}{\partial\beta_j}&=x_{ij}\quad\text{ par} \quad \eqref{trois} \end{align*}\] Combining these terms, we have: \[\frac{\mathcal{L}_i(\beta)}{\partial\beta_j}=\frac{Y_i-b'(\theta_i)}{\phi}\frac{\phi}{\text{Var}[Y_i|x_i]} h'(\eta_i)x_{ij}\] Thus: \[\frac{\mathcal{L}_i(\beta)}{\partial\beta_j}=\frac{Y_i-b'(\theta_i)}{\text{Var}[Y_i|x_i]} h'(\eta_i)x_{ij} \] \(\square\)
With the canonical link defined, it is possible to simplify the expressions in the likelihood equations. This leads to a more efficient formulation for estimating parameters, as shown in the following proposition.
For the canonical link, the likelihood
equations are simplified: \[
\sum_{i=1}^n\frac{(Y_i-\mu_i)x_{i,j}}{\phi} =0,\quad j=1,\cdots,p.
\]
Proof. We have \[\begin{align} \label{six} \tag{eq: Prop3-1} &\text{E}[Y_i|x_i]=h(x_i^\top\beta)=b'(\theta_i)\\ &h^{-1}\left(\text{E}[Y_i|x_i]\right)=x_i^\top\beta=g\left(\text{E}[Y_i|x_i]\right)\nonumber\\ \label{sept} \tag{eq: Prop3-2} &h^{-1}(\cdot)=g(\cdot)\quad\text{fonction de lien} \end{align}\] From this, it follows that: \[\eqref{six}\Rightarrow\theta_i=[b']^{-1}\circ h(x_i^\top\beta)\] \[\eqref{sept}\Rightarrow\theta_i=[b']^{-1}\circ g^{-1}(x_i^\top\beta)\]
Canonical Link Function Framework: If \(h^{-1}=g=[b']^{-1}\), then: \[ [b']^{-1}\circ h=[b']^{-1}\circ g^{-1}=Identity \] Hence, we have: \(\theta_i=x_i^\top\beta=\eta_i\). Thus \[ \theta_i=x_i^\top\beta=\eta_i,\quad\mu_i=b'(\theta_i)=\text{E}[Y_i|x_i],\quad h(\cdot)=b'(\cdot)\quad h'(\cdot)=b''(\cdot) \] The likelihood equations are such that: \[ \sum_{i=1}^n\frac{Y_i-\mu_i}{ \text{Var}[Y_i]}h'(\eta_i) x_{i,j}=0,\quad j=1,\cdots,p \] Substituting: \[ \sum_{i=1}^n\frac{Y_i-b'(\theta_i)}{ \text{Var}[Y_i]}b''(\theta_i) x_{i,j}=0,\quad j=1,\cdots,p \] Since \(\text{Var}[Y_i]=b''(\theta_i)\phi\), we have: \[ \sum_{i=1}^n\frac{Y_i-b'(\theta_i)}{ \phi} x_{i,j}=0,\quad j=1,\cdots,p \] \(\square\)
📝 Remark.
The likelihood equations are often transcendental, meaning they cannot be solved explicitly. Instead, numerical optimization methods like Newton-Raphson or Iteratively Reweighted Least Squares (IRLS) algorithm are essential.
Example. In logistic regression, \(Y_i | x_i \sim \text{Bernoulli}(\pi_i)\), where: \[ \mu_i = \pi_i = \frac{\exp(x_i^\top \beta)}{1 + \exp(x_i^\top \beta)} \quad \text{and} \quad \phi = 1. \]
The likelihood equations reduce to: \[ \sum_{i=1}^n \left( Y_i - \frac{\exp(x_i^\top \beta)}{1 + \exp(x_i^\top \beta)} \right)x_{ij} = 0, \quad \forall j = 1, \ldots, p. \]
Using the IRLS/Newton-Raphson framework, the logistic regression parameters are iteratively estimated with updates that adapt to the curvature of the likelihood function.
[Optimization Reminder] The objective is to find the root of a function \(f\), meaning to solve for \(x\) such that \(f(x)=0\).
When a direct solution is challenging, \(f(x)\) can be locally approximated by its tangent at a point \(x^{(k)}\). Using a first-order Taylor expansion, we write: \[f(x)\approx f(x^{(k)}) + (x-x^{(k)})f'(x^{(k)}).\] The root of this linear approximation corresponds to where the tangent intersects the x-axis: \[ f(x^{(k)}) + (x - x^{(k)})f'(x^{(k)}) = 0 \quad \Rightarrow \quad x^{(k+1)} = x^{(k)} - \frac{f(x^{(k)})}{f'(x^{(k)})}. \]
This iterative method, known as Newton’s method, generates a sequence: \[ x^{(k+1)} = x^{(k)} - \frac{f(x^{(k)})}{f'(x^{(k)})}. \]
The Newton-Raphson method is a powerful approach for optimizing a differentiable objective function, such as the log-likelihood \(\mathcal{L}(\beta)\). For generalized linear models, it iteratively updates the parameter \(\beta\) until convergence.
📝 Remarks.
- Efficiency of updates: Each iteration involves solving a linear system of equations. Efficient computation of the Hessian and its inverse is critical for large-scale problems.
- Limitations: Newton-Raphson can struggle if \(\mathcal{L}(\beta)\) is not concave or if the Hessian is ill-conditioned. Modifications like step-size adjustments or quasi-Newton methods (e.g., BFGS) are often used to address these issues.
[Choices of \(A_k\)] The update step involves the matrix \(A_k\), which determines the step size and direction. There are two common choices:
IRLS (Iteratively Reweighted Least Squares): Uses an approximation of the expected Hessian: \[ A_k = \left[\mathbb{E}(\nabla^2 \mathcal{L}(\beta^{(k)}))\right]^{-1}. \] IRLS is particularly useful in the canonical link case, as it simplifies the iterative process.
Newton-Raphson Algorithm (NRA): Directly uses the exact Hessian: \[ A_k = \left[\nabla^2 \mathcal{L}(\beta^{(k)})\right]^{-1}. \] This choice ensures rapid convergence for smooth, concave objective functions.
[Matrix Representation for the Canonical Case]
Gradient of the Log-Likelihood. For the canonical case, the gradient of the log-likelihood is given by: \[ \nabla \mathcal{L}(\beta) = \frac{1}{\phi} X^\top (Y - \mu(\beta)), \] where \(X\) is the \(n \times p\) design matrix, \(Y\) is the \(n \times 1\) response vector and \(\mu(\beta)\) is the mean response vector: \[ \mu(\beta) = \begin{bmatrix} b'(x_1^\top \beta) \\ \vdots \\ b'(x_n^\top \beta) \end{bmatrix}. \]
Hessian of the Log-Likelihood. For all \(j,k=1\cdots,r\) \[ \left[\mathcal{H}(\mathcal{L})(\beta)\right]_{j,k}=\frac{\partial^2\mathcal{L}(\beta)}{\partial\beta_j\partial\beta_k}=\frac{\partial\mathcal{L}(\beta)}{\partial\beta_k}\left[ \frac{1}{\phi} \sum_{i=1}^n(Y_i-b' (x_i \beta))x_{ij} \right] =-\frac{1}{\phi}\sum_{i=1}^n b''(x_i\beta)x_{ij} x_{ik} \] Then, the Hessian is the second derivative matrix: \[ \mathcal{H}(\mathcal{L})(\beta) = -\frac{1}{\phi} X^\top W(\beta) X, \] where \(W(\beta)\) is an \(n \times n\) diagonal matrix with: \[ W_{ii}(\beta) = b''(x_i^\top \beta), \quad \forall i = 1, \ldots, n. \]
Parameter update rule Combining the gradient and Hessian, the Newton-Raphson update becomes: \[ \begin{array}{ll} \beta^{(k+1)}&= \beta^{(k)}-(\mathcal{H}(\mathcal{L})(\beta^{(k}))^{-1} \bigtriangledown\mathcal{L}(\beta^{(k)})\\ &\beta^{(k)}+\phi \left[ X^\top W(\beta^{(k)})X \right]^{-1} \frac{1}{\phi} X^\top (Y-\mu(\beta^{(k)})\\ &\beta^{(k)}+\left[ X^\top W(\beta^{(k)})X \right]^{-1} X^\top (Y-\mu(\beta^{(k)}) \end{array} \] This can be rewritten as:
\[ \begin{array}{ll} \beta^{(k+1)}&= \left[ X^\top W(\beta^{(k)})X \right]^{-1}X^\top W(\beta^{(k)}) \left[X\beta^{(k)}+W^{-1}(\beta^{(k)})(Y-\mu(\beta^{(k)}) ) \right]\\ &= \left[ X^\top W(\beta^{(k)})X \right]^{-1}X^\top W(\beta^{(k)}) Z^k, \end{array}\] where \(Z^{(k)}\) is the working response vector: \[ Z^{(k)} = X \beta^{(k)} + W(\beta^{(k)})^{-1}(Y - \mu(\beta^{(k)})). \]
📝 Remarks.
- In this framework, the IRLS algorithm reformulates the problem of maximizing the log-likelihood (or minimizing the divergence criterion) at each iteration as a weighted linear regression on a transformed response \(Z^{(k)}\).
- The process iteratively refines \(\beta\) until convergence, with updates adapting to the curvature of the objective function via \(W(\beta^{(k)})\).
- Concavity: The Newton-Raphson algorithm is particularly effective when the likelihood is concave, as it ensures global convergence.
Denote by \(I(\beta_0)\), the Fisher information matrix evaluated at \(\beta_0\). Under certain regularity conditions, the maximum likelihood estimator \(\widehat{\beta}_n^{ML}\), defined by \[ \widehat{\beta}_n^{ML} := \underset{\beta}{\arg\max} \, \sum_{i=1}^n \frac{Y_i x_i^T \beta - b(x_i^T \beta)}{\phi}, \] has the following asymptotic properties:
Moreover, \[ I^{1/2}(\widehat{\beta}_n^{ML}) \sqrt{n} \left( \widehat{\beta}_n^{ML} - \beta_0 \right) \overset{\mathcal{D}\text{ under } \mathbb{P}_{\beta_0}}{\longrightarrow} \mathcal{N}(\boldsymbol{0}_p, \textbf{I}_p). \]
📝 Remarks.
- Fisher Information Matrix. The matrix \(I(\beta_0)\) is crucial because it quantifies the amount of information the observed data provides about the parameter \(\beta_0\). A larger Fisher information value indicates more reliable estimates.
- Consistency. The first property indicates that as the sample size \(n \to \infty\), the MLE \(\widehat{\beta}_n^{ML}\) converges in probability to the true parameter value \(\beta_0\). This guarantees that with a sufficiently large sample, the MLE will provide an accurate estimate.
- Asymptotic Normality. The second property implies that, for large \(n\), the distribution of the MLE \(\widehat{\beta}_n^{ML}\) approximates a multivariate normal distribution centered around the true parameter value \(\beta_0\). This provides a foundation for constructing confidence intervals and hypothesis tests.
Taking advantage of the asymptotic normality, we can construct asymptotic confidence intervals. Indeed, from Theorem 1 , it follows that \[ T_j = \left[ \sqrt{n} I^{1/2}(\widehat{\beta}^{ML}) \right]_{jj} \left( \widehat{\beta}_j^{ML} - \beta_j \right) \overset{\mathcal{D}}{\longrightarrow} \mathcal{N}(0,1). \] Thus, for \(\alpha \in (0,1)\), \[ \lim_{n \rightarrow +\infty} P\left( |T_j| \leq q_{1 - \alpha/2}^{\mathcal{N}(0,1)} \right) = 1 - \alpha, \] where \(q_{1 - \alpha/2}^{\mathcal{N}(0,1)}\) is the \((1 - \alpha/2)\)-quantile of the standard normal distribution. An asymptotic confidence interval of level \(1 - \alpha\) for \(\beta_j\) is given by: \[ IC_j := \left[ \widehat{\beta}_j^{ML} - q_{1 - \alpha/2}^{\mathcal{N}(0,1)} \left[ \sqrt{n} I^{1/2}(\widehat{\beta}^{ML}) \right]_{jj}, \; \widehat{\beta}_j^{ML} + q_{1 - \alpha/2}^{\mathcal{N}(0,1)} \left[ \sqrt{n} I^{1/2}(\widehat{\beta}^{ML}) \right]_{jj} \right]. \]
For this chapter, we will use the binary.csv dataset,
which contains information regarding the admissions of students to a new
establishment.
The dataset includes:
admit), which is a factor with
two levels: 0 (not admitted) and 1
(admitted).gre), and the Grade Point Average
(gpa).rank, which has four levels (1,
2, 3 and 4). Note that a rank
1 establishment is more prestigious than a rank
2 establishment.Ⓡ To load and view the dataset, use the following code:
mydata<-read.csv("binary.csv",header=T,sep=",")
mydata$admit<-factor(mydata$admit)
mydata$rank<-factor(mydata$rank)
mydata%>%rmarkdown::paged_table()#library(rsample)
set.seed(0911)
#Create a stratified split of the data, maintaining the proportion of classes
mydata_split <- initial_split(mydata, prop = 0.7, strata = "admit")
#mydata_split <- initial_split(mydata, prop = .7)
train <- training(mydata_split)
test <- testing(mydata_split)Ⓡ You can also obtain a summary of the dataset with:
## admit gre gpa rank
## 0:191 Min. :300.0 Min. :2.420 1: 40
## 1: 88 1st Qu.:520.0 1st Qu.:3.135 2:107
## Median :580.0 Median :3.380 3: 89
## Mean :587.3 Mean :3.399 4: 43
## 3rd Qu.:660.0 3rd Qu.:3.670
## Max. :800.0 Max. :4.000
To avoid any confusion regarding the levels of the response variable, let us rename them:
levels(train$admit)[levels(train$admit) %in% c('0', '1')] <-c('No', 'Yes')
table(train$admit) %>% as.data.frame() %>% setNames(c("Admit", "Counts")) %>% kableExtra::kbl()| Admit | Counts |
|---|---|
| No | 191 |
| Yes | 88 |
For sake of simplicity, we can define:
📝 Remarks.
- The order of the levels in the response variable is significant.
- In machine learning terminology, the class
Y=1is referred to as the positive class, while the classY=0is termed the negative class.- Typically, the reference level (the first level in the
table) is assigned to the majority class, which is considered the negative class (here,0=No). If necessary, you can change the reference level using the following command:
Define the mathematical framework for our analysis of student admissions.
rank, which can take on
\(J=4\) levels (1,
2, 3 and 4). Notably, an
establishment with rank 1 is considered more prestigious
than one with rank 2.`admit) with two modalities. Specifically, \(Y_{ij}=1\) corresponds to the individual
\(i\) not being admitted from rank
\(j\) (i.e., \(Y_{ij}=1\)=Yes), and \(Y_{ij}=0\) signifies admission (i.e., \(Y_{ij}=0=\)No).gre),
and \(X^{(2)}\) denoting the Grade
Point Average (gpa).We can express the model in the following form: \[ Y_{ij}\vert x_{ij}\sim\mathcal{B}(p(x_{ij}))\quad\text{where}\quad x_{ij}^T\beta=\mu+\alpha_j+\beta_1 X_{ij}^{(1)}+\beta_2 X_{ij}^{(2)},\] with the expected value given by \[\text{E}[Y_{ij}\vert x_{ij}]=p(x_{ij})\quad\text{and}\quad p(x_{ij})=\frac{e^{x_{ij}^T\beta}}{1+e^{x_{ij}^T\beta}}.\] It is important to note that \(p(x_{ij})=\mathbb P(Y=1)= \mathbb P(Y=Yes)\).
Model Declaration. There are several approaches to declare a logistic regression model in Ⓡ . We will explore one of them here.
Using theglm() Function. The Generalized
Linear Model (glm) function provides a flexible framework
for modeling various types of data, including binomial outcomes. Here,
we specify the response variable admit in relation to the
covariates rank, gpa, and gre.
Ⓡ
model_glm <- glm(admit ~ rank+gpa+gre,family = "binomial",data = train )
##
## Call:
## glm(formula = admit ~ rank + gpa + gre, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.847985 1.396641 -2.755 0.005866 **
## rank2 -0.699451 0.385413 -1.815 0.069554 .
## rank3 -1.707940 0.426848 -4.001 6.3e-05 ***
## rank4 -1.810166 0.531629 -3.405 0.000662 ***
## gpa 0.883593 0.409500 2.158 0.030948 *
## gre 0.001780 0.001342 1.326 0.184769
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 347.84 on 278 degrees of freedom
## Residual deviance: 311.70 on 273 degrees of freedom
## AIC: 323.7
##
## Number of Fisher Scoring iterations: 4
1. Number of Fisher Scoring iterations The summary output indicates the number of iterations required for convergence. Fisher’s scoring algorithm is a variant of Newton’s method specifically designed for solving maximum likelihood problems numerically.
Ⓡ In this model, it took four iterations to achieve convergence.
## [1] 4
2. Coefficients We assess the significance of each coefficient individually using a Gaussian test. As established in Theorem 1, the following asymptotic relationship holds: \[ I^{1/2}(\widehat{\beta}_n^{ML})\sqrt{n}( \widehat{\beta}_n^{ML}-\beta_0)\overset{\mathcal{D}}{\longrightarrow}\mathcal{N}(\boldsymbol{0}_p,\textbf{I}_p). \]
Ⓡ The coefficients and their associated statistics can be extracted as follows:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.847984992 1.396641426 -2.755170 5.866159e-03
## rank2 -0.699451040 0.385413495 -1.814807 6.955360e-02
## rank3 -1.707940071 0.426847638 -4.001287 6.299879e-05
## rank4 -1.810165945 0.531628860 -3.404943 6.617787e-04
## gpa 0.883593421 0.409500429 2.157735 3.094844e-02
## gre 0.001779811 0.001342024 1.326214 1.847689e-01
Moreover, we can perform the Wald test to further evaluate the null hypothesis regarding the coefficients.
Consider the hypothesis test: \[
\mathcal{H}_0 \ : \ \beta_j=0, \ vs \
\mathcal{H}_1\ : \ \beta_j\neq 0.
\] Under certain assumptions and under \(\mathcal{H}_0\) \[
S:=n\left[
I(\widehat{\beta}^{ML})\right]_{jj}\left(\widehat{\beta}^{ML}_j\right)^2\overset{\mathcal{D}}{\longrightarrow}\chi_{1}^2.
\] For a fixed significance level \(\alpha \in (0, 1)\), the rejection region
is defined as: \[\left\{S\geq
q_{1-\alpha}^{\chi^2_{1}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{1}}\) is the
quantile of order \(1-\alpha\) from the
chi-squared distribution with 1 degree of freedom.
## Wald test for gre
## in glm(formula = admit ~ rank + gpa + gre, family = "binomial",
## data = train)
## F = 1.758843 on 1 and 273 df: p= 0.18588
## Wald test for gpa
## in glm(formula = admit ~ rank + gpa + gre, family = "binomial",
## data = train)
## F = 4.65582 on 1 and 273 df: p= 0.031821
The Wald test can also be applied to categorical factors.
Consider the hypothesis test: \[
\begin{cases}
\mathcal{H}_0 &
\alpha_{(-1)}&=(\alpha_2,\cdots,\alpha_J)^T=\boldsymbol{0}_{J-1},\\
\mathcal{H}_1 & \alpha_{(-1)}&\neq \boldsymbol{0}_{J-1}.
\end{cases}
\] Under certain assumptions and under \(\mathcal{H}_0\) \[S:=\left\|\sqrt
n \,\text{I}\left(\widehat{\beta}_{(-1)}^{ML}\right)\widehat{\alpha}^{ML}_{(-1)}\right\|
^2\overset{\mathcal{D}}{\longrightarrow}\chi_{J-1}^2.
\]
For a fixed significance level \(\alpha \in (0, 1)\), the rejection region is: \[\left\{S\geq q_{1-\alpha}^{\chi^2_{1}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{J-1}}\) is the quantile of order \(1-\alpha\) from the chi-squared distribution with \(J - 1\) degrees of freedom.
📝 Remark.
Note that for categorical variables, the Wald test has a different formulation due to the additional constraint (here \(\alpha_1 = 0\)).
Ⓡ Display Wald test
for categorial variable rank.
## Wald test for rank
## in glm(formula = admit ~ rank + gpa + gre, family = "binomial",
## data = train)
## F = 7.264522 on 3 and 273 df: p= 0.000105
3. Deviance To define deviance, we first introduce the concept of the saturated model:
Therefore, we compare \(\mathcal{L}\) the log of the likelihood of our model, with \(\mathcal{L}_{[m]}\), the log of the likelihood of the saturated model \([m_{sat}]\).
Example.
If \(Y_i\vert x_i\sim\mathcal{B}(p(x_i))\), then for the saturated model \([m_{sat}]\), we have: \[\widehat{\text{E}[Y_i\vert x_i]}=\widehat{p}(x_i)=y_i.\] In this case, the log of the likelihood is zero \[\mathcal{L}_{[m_{sat}]}=\sum_{i=1}^n \log\left(\widehat{p}(x_i)^{y_i} (1-\widehat{p}(x_i))^{1-y_i}\right) =0 \]
If \(Y_i\vert
x_i\sim\mathcal{B}(n,p(x_i))\), then for the saturated model
\([m_{sat}]\), we have: \[\widehat{\text{E}[Y_i\vert
x_i]}=n\widehat{p}(x_i)=y_i.\]
Here, the log of the likelihood is not zero \[\mathcal{L}_{[m_{sat}]}=\sum_{i=1}^n
\log\left(\left(^n_{y_i}\right)(\widehat{p}(x_i))^{y_i}
(1-\widehat{p}(x_i))^{1-y_i}\right) \neq 0 .
\]
[Definition 3]
We The deviance of a model \([m]\) defined with respect to the saturated
model \([m_{sat}]\) is denoted \(\mathcal{D}_{[m]}\) and is equal to \[\mathcal{D}_{[m]}=2\left(\mathcal{L}_{[m_{sat}]}-\mathcal{L}_{[m]}\right)\geq0,\]
where \(\mathcal{L}_{[m_{sat}]}\) and
\(\mathcal{L}_{[m]}\) represent the log
likelihoods in the saturated model and in the model \([m]\), respectively.
📝 Remark.
It is evident that a higher deviance \(\mathcal{D}_{[m]}\) indicates a poorer fit for the model \([m]\).
Ⓡ The residual deviance corresponds to the value of the deviance of our model \([\texttt{mod}]\) \[\mathcal{D}_{[\texttt{mod}]}(\widehat{p})=2(\mathcal{L}_{[m_{sat}]}(\widehat{p})-\mathcal{L}_{[\texttt{mod}]}(\widehat{p}))=-2\mathcal{L}_{[\texttt{mod}]}(\widehat{p})\]
## [1] 311.7008
Ⓡ The null deviance represents the value of the deviance of the null model (which includes only the intercept): \[ \mathcal{D}_{[\texttt{nul}]}(\widehat{p})=-2\mathcal{L}_{[\texttt{nul}]}(\widehat{p})=-2\mathcal{L}_{[\texttt{nul}]}(\overline{y}) \]
## [1] 347.8364
📈 The deviance of our model is significantly lower than the null deviance (the deviance of the model reduced to the intercept). This indicates that our model provides a better fit than the null model.
We start with a Type I test, which evaluates the significance of each predictor in the model sequentially. The anova() function output shows the deviance reductions for each predictor added sequentially.
Ⓡ Here’s the summary:
📈 These results suggest that both
rank and gpa add value to the model, whereas
gre may not be a necessary predictor.
Next, we perform Type II tests using the
drop1 function. This function removes each variable one at
a time and evaluates the change in deviance to determine if the model
fit significantly worsens without that variable. This approach is useful
to check if each variable contributes significantly to the model
(note: this method is not compatible with
multinom models).
⚠️This method is not compatibleis not compatible with
multinommodels.
Ⓡ Here’s the
drop1()output:
📈 Both Type I and Type II tests
highlight rank and gpa as significant
predictors, while gre appears unnecessary.
Based on the results above, we’ll explore different model
configurations to identify the best-fitting model for predicting
admit.
Ⓡ Here are the models:
model_glm<-glm(admit~rank+gpa+gre,family='binomial',data=train)
modF2<-glm(admit~rank+gre,family='binomial',data=train)
modF<-glm(admit~rank+gpa,family='binomial',data=train)
modNull<-glm(admit~1,family='binomial',data=train)To compare these models, we can use the LRstats()
function from the vcdExtra package, which provides an
asymptotic Goodness-of-Fit test by deviance to check if a model \([mod]\) of size \(m\) is adequate to explain the data.
\[ \begin{cases} \mathcal{H}_0 :& \mathcal{D}_{[mod]}=0&\textbf{model is adequate}\\ \mathcal{H}_1 :& \mathcal{D}_{[mod]}>0&\textbf{model is inadequate}\\ \end{cases} \]
📝 Remarks.
- This test assesses if the model differs significantly from a saturated model.
- Under \(\mathcal{H}_0\), the test implies that there is no significant difference between the models.
[Proposition 6]
Under certain conditions and under \(\mathcal{H}_0\) \[\mathcal{D}_{[mod]}\overset{\mathcal{D}}{\longrightarrow}
\chi^2_{n-m},
\] where \(n−m\) represents the
degrees of freedom. For a given significance level \(\alpha\in(0,1)\), the rejection region is:
\[\left\{\mathcal{D}_{[mod]}\geq
q_{1-\alpha}^{ \chi^2_{n-m}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{n-m}}\) is the
quantile of order \(1-\alpha\) from the
chi-squared distribution with \(n-m\)
degrees of freedom.
Ⓡ Here’s the
LRstats()output:
📈 Interpretation
AIC and
BIC values are lowest for modF and
model_glm, suggesting they are the best-fitting models
among those considered.model_glm) is nearly significant (\(\approx\) 0.05), suggesting it fits
reasonably well but may include unnecessary complexity.modF is the preferred model, as it balances model fit
with simplicity, making it both interpretable and efficient for
prediction.To delve deeper, we compare nested models using a Likelihood Ratio (LR) test. This test evaluates whether adding parameters (moving from a simpler model to a more complex model) significantly improves fit.
[Proposition 7]
Consider two nested models \([m_0]\) and \([m_1]\), where (\([m_0]\subset [m_1]\)). The hypotheses are: \[ \begin{cases} \mathcal{H}_0 :& [m_0] &\text{ is adequate}\\ \mathcal{H}_1 :& [m_1] &\text{ is adequate}\\ \end{cases} \]
Under \(\mathcal{H}_0\),the LR test statistic is defined as: \[\text{LRtest}:=2(\mathcal{L}_{[m_1]}-\mathcal{L}_{[m_0]})=(\mathcal{D}_{[m_0]}-\mathcal{D}_{[m_1]})\overset{\mathcal{D}}{\longrightarrow} \chi^2_{m_1-m_0}.\] where \(m_1−m_0\) represents the degrees of freedom. For a given significance level \(\alpha\in(0,1)\), the rejection region is: \[\left\{\text{LRtest}\geq q_{1-\alpha}^{ \chi^2_{m_1-m_0}}\right\},\] where \(q_{1-\alpha}^{ \chi^2_{n-m}}\) is the quantile of order \(1-\alpha\) from the chi-squared distribution with \(m_1-m_0\) degrees of freedom.
Ⓡ To perform that
test, we can use the lrtest function of the
lmtestpackage.
📈 The result confirms that
modF is preferable over model_glm.
Testing for Interactions. Lastly, we consider
whether interaction terms improve the model fit. We introduce an
interaction between gpa and rank:
📈 The test indicates that adding the
interaction term does not improve the model, so we maintain
modF as the preferred model.
Recall that the class of positives is class 0 in our
example. The confusion matrix summurizes:
tp. Correctly predicted
positives (predicts 1=Yes and actual is 1=Yes
).tn. Correctly predicted
negatives (predicts 0=No and actual is 0=No).
fp. Incorrectly
predicted positives (predicts 1=Yes actual is
0=No). fn. Incorrectly
predicted negatives (predicts 0=No and actual is
1=Yes).The counts in the matrix depend on the threshold \(s\) used to classify probabilities estimated by the model. Adjusting \(s\) changes the sensitivity and specificity of predictions, balancing the rate of TPs against FPs.
1. Confusion Matrix Structure
| Threshold \(s\) | \(Y_i=0\quad\) | \(Y_i=1\quad\) |
|---|---|---|
| \(\widehat Y_i=0\quad\) |
tn
|
fn
|
| \(\widehat Y_i=1\quad\) |
fp
|
tp
|
2. Different Thresholds
To examine how threshold \(s\)
affects predictions, we display the predicted probabilities for
modF. The following plot shows the distribution of
predicted probabilities split by default threshold \(s=0.5\)
The following plots show the distribution of predicted probabilities splitby default threshold \(s=0.5\)
The confusion matrix for \(s=0.5\) is:
| Threshold by default \(s=0.5\) |
\(Y_i=0\)=No
|
\(Y_i=1\)=Yes
|
|---|---|---|
| \(\widehat Y_i=0\quad\) | 172 | 60 |
| \(\widehat Y_i=1\quad\) | 19 | 28 |
Now, modify the threshold: \(s=0.3\) Lowering \(s\) captures more instances in the positive class, increasing sensitivity.
The confusion matrix for \(s=0.3\) is:
| Threshold \(s=0.3\) |
\(Y_i=0\)=No
|
\(Y_i=1\)=Yes
|
|---|---|---|
| \(\widehat Y_i=0\quad\) | 124 | 30 |
| \(\widehat Y_i=1\quad\) | 67 | 58 |
Effect of the threshold As the threshold decreases from \(s=0.5\) to \(s=0.3\):
fn and tn decrease, fp and
tp increase.🔍 How can we determine the optimal threshold? Choosing the right threshold requires an appropriate evaluation metric.
Evaluation metrics are essential for assessing the quality of a classification model. Each metric offers a unique perspective on performance, helping you choose the right model and threshold for your needs. Here are three key metrics, each answering a different question about model quality:
ppv). When the model predicts a positive, how
often is it correct?tpr). How effectively does the model identify
all actual positives?🤔 Accuracy answers the question: how often the model is right?
Accuracy (\(\text{acc}\)) measures the proportion of correct predictions out of the total predictions, providing a general sense of model effectiveness: \[ \text{acc}=\frac{\text{tp}+\text{tn}}{\text{tp}+\text{tn}+\text{fp}+\text{fn}} \]
This is the proportion of correct classifications. Here, for a threshold of \(s=0.5\), the accuracy is 0.7168459. However, adjusting the threshold \(s\) may improve accuracy further.
Let visualizes accuracy across thresholds to identify an optimal threshold:
📈 The plot shows how accuracy can improve by adjusting the threshold. For instance, the optimal threshold is \(s\)=0.5386559 for maximum accuracy of 0.734767. This adjustment raises accuracy from 0.7168459 to 0.734767, though it may affect other metrics.
Updating the confusion matrix at this optimal threshold reveals changes:
| Threshold \(s=0.5386559\) |
\(Y_i=0\)=No
|
\(Y_i=1\)=Yes
|
|---|---|---|
| \(\widehat Y_i=0\quad\) | 182 | 66 |
| \(\widehat Y_i=1\quad\) | 9 | 22 |
📈 At this new threshold,
fp and tp increase while the number of
tn and fn decrease.
Pros.
Cons.
📝 Remark.
In real-world applications with imbalanced classes, such as detecting missiles, global accuracy can be misleading. Missing an actual missile (false negative) is more critical than falsely predicting one, so accuracy alone may be insufficient. Instead, Precision and Recall provide a more reliable basis for choosing a threshold in these scenarios.
🤔 Precision answers the question: When the model predicts a positive, how often is it correct?
Precision, also known as Positive Predictive Value
(ppv), measures the accuracy of positive
predictions made by a classification model. It quantifies how often the
model correctly identifies instances of the positive class.
Specifically, precision is defined as the ratio of true positives to the
sum of true positives and false positives: \[
\text{prec}=\text{ppv}=\frac{\text{tp}}{\text{tp}+\text{fp}}
\]
Example. To illustrate this concept, consider the scenario of detecting spam in an email inbox, where “spam” is designated as the positive class.
In this case, the cost of a false positive is generally higher, making it crucial to minimize false positives. Therefore, maximizing precision becomes essential to ensure the model is reliable when predicting spam.
Pros.
Cons.
📝 Remark.
Note that we can also define Negative Predictive Value (\(\text{npv}\)), which assesses the model’s performance concerning the negative class. It is compared to the estimated total negatives (\(\text{tn}+ \text{fn}\)): \[ \text{npv}=\frac{\text{tn}}{\text{tn}+\text{fn}} \]
🤔 Recall answers the question:How effectively does the model identify all actual positives?
Recall, also known as Sensitivity
or True positive rate (tpr**), quantifies
the ability of a classification model to correctly identify positive
instances from the total actual positives in the dataset. It is defined
as the ratio of true positives to the total number of actual positives:
\[
\text{rec}=\text{sens}=\text{tpr}=\frac{\text{tp}}{\text{tp}+\text{fn}}
\]
Example. To illustrate this concept, consider the case of identifying sick patients in a medical setting, where “sick” represents the positive class.
In this scenario, the cost of a false negative is particularly significant, as failing to identify a sick patient could have serious health consequences. Therefore, minimizing false negatives is essential, which directly translates to maximizing recall.
Pros.
Cons.
Prioritizes false negatives over false positives. Recall tends to treat false negatives as more costly than false positives, which might not align with all situations. While it aims to capture all positives, it may lead to an unacceptably high number of false positives if not managed carefully.
Potential for misleading performance. If the goal is to achieve “total recall,” a naive strategy would be to classify every instance as positive, resulting in 100% recall but an overwhelming number of false positives. This can render the model ineffective and diminish its practical utility.
📝 Remarks.
- The term ” Sensitivity ” is more commonly used in medical and biological research rather than machine learning.
- Additionally, we can define the False Negative Rate (FNR), which measures the proportion of actual positives that were not correctly identified \((\text{tp}+\text{fn})\) \[ \text{fnr}=\frac{\text{fn}}{\text{tp}+\text{fn}}=1-\text{rec}=1-\text{sens} \]
#pred <- prediction(glm_pred$probs,glm_pred$y)
#perf <- performance(pred, "fp","tp","fn","tn","npv","tnr")Visualize Precision and Recall at various threshold values to understand how adjusting the threshold impacts model performance.
Take away. Precision is a suitable metric when you are more concerned about “being right” when assigning the positive class than “detecting them all.” Conversely, the Recall metric describes the model’s ability to predict the positive class when the actual result is positive.
In Summary.
In addition to Precision and Recall, several other important metrics provide a more comprehensive evaluation of classification models. Here, we introduce two key metrics: Specificity and the False Positive Rate.
Specificity (True Negative Rate, TNR) measures the proportion of actual negatives that are correctly identified by the model. It assesses the model’s ability to avoid false positives. Specifically, it is defined as the ratio of true negatives to the total number of actual negatives \((\text{tn}+\text{fp})\): \[ \text{spec}=\text{tnr}=\frac{\text{tn}}{\text{tn}+\text{fp}} \] High specificity indicates that the model is effective in identifying negative instances, which is particularly important in scenarios where the cost of false positives is high.
False Positive Rate (FPR) quantifies the proportion of actual negatives \((\text{tn}+\text{fp})\) that are incorrectly classified as positives. It provides insight into how often the model mistakenly labels a negative instance as positive: \[ \text{fpr}=\frac{\text{fp}}{\text{tn}+\text{fp}}=1-\text{tnr}=1-\text{spec} \] A low FPR indicates that the model is reliable in minimizing the misclassification of negative instances.
This matrix provides a comprehensive overview of the model’s performance across different classifications:
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 172 60
## Yes 19 28
##
## Accuracy : 0.7168
## 95% CI : (0.6601, 0.7689)
## No Information Rate : 0.6846
## P-Value [Acc > NIR] : 0.1363
##
## Kappa : 0.2501
##
## Mcnemar's Test P-Value : 6.784e-06
##
## Sensitivity : 0.3182
## Specificity : 0.9005
## Pos Pred Value : 0.5957
## Neg Pred Value : 0.7414
## Prevalence : 0.3154
## Detection Rate : 0.1004
## Detection Prevalence : 0.1685
## Balanced Accuracy : 0.6094
##
## 'Positive' Class : Yes
##
Key metrics interpretation.
Accuracy The model achieves an accuracy of approximately 71.68% indicating that it correctly predicts the outcome for nearly three-quarters of all instances. While this is a reasonable accuracy, further improvements may be needed, particularly in identifying positive instances. We are 95% confident that the true accuracy of the model lies within this interval. The range suggests that while the model performs well, there may be variability in its performance.
Sensitivity (Recall). The model correctly identifies around 31.82% of actual positive cases (Yes), indicating that it misses a notable number of positive instances.
Specificity.The model accurately identifies approximately 90.05% of actual negatives (No), demonstrating strong ability in avoiding false positives.
Positive Predictive Value (PPV or Precision). When the model predicts a positive (Yes), it is correct 59.57% of the time, reflecting a moderate level of precision.
Negative Predictive Value (NPV). The model correctly predicts the negative class around 74.14% of the time.
The confusion matrix and associated metrics indicate that, while the model has high specificity, it lacks sensitivity. To improve its ability to detect positive cases without significantly increasing false positives, adjusting the classification threshold or applying other modeling techniques may be beneficial.
There are various methods to evaluate the performance of a prediction model, which allows for meaningful comparisons between different models. Graphically, ROC curves is commonly employed. ROC curves are ideal when the number of observations for each class is approximately equal.
📝 Remarks.
- ROC curves may present an overly optimistic view of the model’s performance on datasets with class imbalance.
- To compare models directly or evaluate their performance at different thresholds, we analyze the shapes of their curves.
The ROC curve (Receiver Operating Characteristic curve) is a useful tool for predicting the probability of a binary result. It consists in a graphical representation of the \(\text{fpr}(s)\), the “false alarm rate” (plotted on the \(x\)-axis) against the \(\text{tpr}(s)\), the “success rate”, also known as Recall or Sensitivity, (plotted on the \(y\)-axis) a range of threshold values \(s \in[0,1]\).
The shape of the curve conveys significant information:
fp and higher tn
count.tp and fewer fn count. [The North-West Rule]
Intuitively, a point \((\text{fpr}(s),\text{tpr}(s))\) of the ROC
curve representing one model is considered better than that of another
model \((\text{fpr}(s'),\text{tpr}(s'))\)
if it is northwest of it, meaning it has a higher true positive rate and
a lower false positive rate.
📝 Remarks.
- A model with perfect skill is is represented by \((\text{fpr}(s),\text{tpr}(s))=(0,1)\).
- The model such that \((\text{fpr}(s),\text{tpr}(s))=(0,0)\) systematically predicts the negative class.
- The procedure such that \((\text{fpr}(s),\text{tpr}(s))=(1,1)\) systematically predicts the positive class.
- A no-skill model is one that cannot distinguish between the classes and would predict a random class or a constant class in all cases, represented by \((\text{fpr}(s),\text{tpr}(s))= (0.5, 0.5)\).
- A no-skill model is depicted by a diagonal line from the bottom left to the top right of the plot.
The ROC curve, enabling us to assess how well the model discriminates between the positive and negative classes. This will assist in selecting an optimal threshold that balances the rates of false positives and false negatives effectively.
📝 Remark.
An operator may plot the ROC curve for the final model and choose a threshold that achieves a desirable balance between false positives and false negatives.
The Area Under the Curve (AUC) serves as a robust metric for summarizing the performance of a classification model across all possible thresholds. By comparing the AUC values of different models, we can assess their relative strengths and weaknesses.
AUC quantifies the model’s overall ability to discriminate between positive and negative classes. AUC values range from 0 to 1, where 0 indicates that all predictions are incorrect, and 1 indicates that all predictions are correct.
Pros.
Cons.
The ROC-AUC in our setting is:
## Roc_AUC : 0.7146597
Interpretation.
In this case, the model shows limitations in predicting positive classes, particularly in the context of class imbalance. This emphasizes that while the model performs well overall (as indicated by the ROC-AUC), it struggles significantly with accurately identifying the minority class (events denoted as Yes).
📝 Remarks: Importance of Threshold Adjustment.
- When comparing models, a higher AUC indicates better performance!
- Imbalanced Classes. In datasets with class imbalance, a fixed threshold (like 0.5) may lead to suboptimal performance for the minority class. For instance, a model might predict all instances as negative to maximize precision, which is not useful for detecting the positive class.
In a linear regression setting, residuals are defined as the difference between the observed values \(y_i\) and the predicted values \(\widehat{\text{E}[Y_i\vert x_i]}=\widehat y_i\). However, in a more general context, it is possible that \(\widehat{\text{E}[Y_i\vert x_i]}\neq \widehat y_i\).
😕 In logistic regression, the outcome data is discrete, and consequently, the residuals are also discrete. Therefore, plots of raw residuals from logistic regression are generally not very informative.
1. Response residuals In logistic regression, the residuals are defined as the difference between the observed values \(y_i\) and the predicted probabilities \(\widehat{p}(x_{i})=g^{-1}(x_{i}^T\widehat\beta)\): \[ res_{response_i}:=\widehat\epsilon_{i}=y_{i}-\widehat{p}_{i}. \]
2. Pearson residuals The Pearson residuals are obtained by normalizing the response residuals \(res_{response_i}\) using the estimated variance \(\widehat{\text{Var}(Y_{i})}\), the estimated variance of \(Y_{i}\). In a logistic context, this variance is given by: \[\widehat{\text{Var}(Y_{i})}=\widehat{p}(x_{i})( 1-\widehat{p}(x_{i})).\] The Pearson residuals are then defined as: \[ res_{pearson_i} =\frac{res_{response_i}}{\sqrt{\widehat{p}(x_{i})( 1-\widehat{p}(x_{i}))}} \]
3. Standardized Pearson residuals The Standardized Pearson residuals \(res_{Stand.pearson_i}\) are calculated by normalizing the Pearson residuals \(res_{pearson_i}\) with the leverage effect: \[ res_{Stand.pearson_i} =\frac{res_{pearson_i}}{\sqrt{{(1-h_{ii}))}}}, \] where \(h_{ii}\) is the \(i^{th}\) diagonal element of the projection matrix \(H=X(X^TX)^{-1}X^T\) in the full rank setting of the matrix \(X\).
4. Deviance residuals
The summary of model_glm includes the deviance
residuals, which are suitable for generalized models. These residuals
measure how far \(\mathcal{L}_{[m]}(\widehat\beta,y)\) for
the \(i\)-th observation is
from the saturated \(\mathcal{L}_{[m_{sat}]}(\widehat\beta_{sat},y)\).
To define them:
The deviance is calculated as: \[\mathcal{D}_{[m]}=2\left( \mathcal{L}_{[m_{sat}]}(\widehat\beta_{sat},y)-\mathcal{L}_{[m]}(\widehat\beta,y) \right) \] where \(\widehat\beta\) and \(\widehat\beta_{sat}\) are the maximum likelihood estimators obtained from the models \([m]\) and \([m_ {sat}]\), respectively.
For any model \([m]\),, the log of the likelihood \(\mathcal{L}_{[m]}\) can be expressed as a sum of contributions from each data point: \[ \mathcal{L}_{[m]}(\beta,y)=\sum_{i=1}^n \log\left( p(x_i)^{y_i} (1- p(x_i))^{1-y_i}\right):=\sum_{i=1}^n \mathcal{L}_{[m]}(\beta,y)|_i \]
Thus, the deviance can also be expressed as a sum of contributions from each observation: \[ \mathcal{D}_{[m]}:=\sum_{i=1}^n d_i \] where \(d_i\) represents the deviance due to observation \(i\). Using this formulation, the deviance residual for observation \(i\) is defined as: \[ res_{dev_i} ={\rm sign}(y_i-\widehat{p}_i)\sqrt{ d_i}. \]
5. Standardized deviance residual
Standardized deviance residuals measure how far \(\mathcal{L}_{[m]}(\widehat\beta,y)\) for the \(i\)-th observation is from \(\mathcal{L}_{[m_{sat}]}(\widehat\beta_S,y)\) for the same observation, but they are normalized through the leverage effect: \[ res_{stand\_dev_i} =\sqrt{\frac{res_{dev_i}}{(1-h_{ii})}}. \]
📝 Remark.
If the model is “good,” we can show that the residuals are asymptotically Gaussian (this can be verified using a Q-Q plot).
[Which Residuals Should You Focus On?]
As is often the case in statistics, the answer depends on the context!
1. Residuals vs. Fitted plot
The Residuals vs. Fitted plot is typically used to check for trends in the residuals. This plot utilizes Pearson residuals.
😕 In logistic regression, the discrete nature of the residuals means that plots of raw residuals are generally not informative.
2. Scale-Location plot The Scale-Location plot is commonly employed to detect heteroskedasticity. Here, Standardized Pearson residuals \(res_{Stand.pearson_i}\) are used here.
😕 Due to the nature of the response variable \(Y\), the residuals are often heteroskedastic, rendering this plot less useful.
3. Normal Q-Q plot_ In a linear setting, the Q-Q plot is used to assess the normality of the residuals.
😕 Although standardized Pearson residuals have an approximate standard normal distribution (as noted in Agresti, 2002), interpreting the plot can be challenging. Therefore, this Normal Q-Q plot is not particularly useful.
1. Normal Q-Q plot Ⓡ We can generate a Normal Q-Q plot based on either Standardized Pearson residuals or Standardized Deviance residuals.
residual_glm<-rstandard(model_glm,type = 'deviance')
#residual_glm<-rstandard(model_glm,type = 'pearson')
ggplot(train,aes(sample = residual_glm,col=admit)) +geom_qq(size=1)+geom_qq_line(col='black')+ facet_wrap(~admit)+
ylab("Standardized deviance residuals Quantiles") + xlab("Theoretical Quantiles") +
ggtitle("Normal Q-Q plot") 📝 Remarks.
- The function
geom_qq(distribution = stats::qnorm)defaults to a normal distribution.- The
geom_qq_line()function adds a line to the theoretical Q-Q plot that passes through the first and third quartiles.
📈 The quantiles of our sampled data closely align with the theoretical quantiles, indicating that our residuals are approximately normally distributed.
2. An alternative to Residuals vs Fitted plot It is essential to ensure that there is no underlying structure or trend in the residuals. If a trend is detected, it may be necessary to investigate the cause, such as a poor model fit or a nonlinear relationship.
The Binned Residuals Plot serves as an alternative to the Residuals vs. Fitted plot. It divides the data into categories (bins) based on their fitted values, plotting the average residual against the average fitted value for each bin.
Expected values) versus the average
Response residuals (Average residual) for each
bin.Expected residuals. The difference between the proportion of Yes responses observed among the points in a bin and the average predicted probability for those points. \[ res_{Expected_j}:=\frac{\sum_{bin_j}y_i}{\#\,bin_j}-\frac{\sum_{bin_j}\widehat{p}(x_i)}{\#\,bin_j} \] Example. For a group of 20 points with 11 Yes responses and an average prediction of 0.6, the expected residual would be: (11/20 - 0.6)=-0.05.
Average residuals. \[ res_{Average_j}:=\frac{\sum_{bin_j}(y_i-\widehat{p}(x_i))}{\#\,bin_j}=\frac{\sum_{i\in bin_j}res_{response_i}}{\#\,bin_j} \]
Ⓡ The
binnedplot function from the arm package is
used to create this binned residual plot.
📈 The graph produced by
binnedplot also displays a 95% prediction interval
(indicated by blue lines) for the mean residuals. If the binomial model
is adequate, approximately 95% of the residuals should fall within this
interval, which is indeed the case here.
📝 Remarks.
- By default,
binnedplotselects the number of groups based on a balance between having enough points per bin (to ensure accurate average proportions) and enough bins (to observe trends if they exist). For \(n>100\), the number of groups chosen is approximately \(\sqrt n\).- The
binnedplotfunction can also be used to investigate residuals in relation to individual predictors (see Gelman, page 98, for an example).